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>  The  use  of  graphical  methods  for  diagnostic  purposes  has  an  honorable  tradition  which  is 


rooted  in  the  pioneering  work  of  Anscombc  and  Tukey  and  has  been  developed  b\  Wilk 

L  — 1-  /■ 

Gnanadesikan.  and  a  host  of  others  associated  with  Bel!  Laboratories.  Thr—rre 
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Landwehr.  Pregibon,  and  Shoemaker  (subsequently  referred  to  as  1  PS)  attempts  to  modify  and 
extend  graphical  diagnostic  displays  that  have  been  developed  for  ordinary  regression  to  be  of 
use  for  assessing  logistic  regression  models  for  binary  data.  They  propose  displays  for  each  of 
the  three  key  components  of  regression  diagnostics:  goodness  of  fit.  outlier  detection,  and 
model  specification.  Theirs  is  a  pioneering  effort  and  many  useful  ideas  have  emerged  from 


iL^We  congratulate  the  authors  on  the  substantial  progress  they  have  made  to  date. 


—  Somewhat  fuzzy  analogies  to  linear  regression  arc  not  sufficient  to  motivate  for  us  the 
approaches  adopted  by  LPS.  Thus  yt c  have  attempted  to  examine  critically  LPS  s  diagnostic 
displays  to  see  if  could  determine  why  in  each  instance  the  method  works  or  fails.  LI’S 


suggest  that  the  major  obstacle  in  carrying  linear  regression  diagnostics  over  to  the  logistic 


regression  setting  is  the  discreteness  of  binary  data.  While  discreteness  may  well  be  a  serious 
problem,  yt  note  additional  ones^  Although  our  examination  is  far  from  complete,  we  hope  it 


will  be  a  useful  supplement  to  the  present  paper. 


We  deal  separately  with  each  diagnostic  display. 


1.  LOCAL  MEAN  DEVIANCE  PLOTS 

The  key  idea  behind  this  plot  is  to  focus  not  on  a  global  measure  of  goodness-of-fit  but 
rather  on  local  contributions  to  the  fit.  The  approach  is  based  on  an  analogy  with  the  linear 
regression  problem  with  replicated  observations  where  we  can  partition  the  sum  of  squared 
residuals  (SS)  into  a  pure-error  SS  and  a  lack-of-fil  SS.  LPS  claim  that  for  the  logistic 
regress  on  setting  that,  "if  there  are  exact  replicates  in  the  data,  the  pure-error  component  of 
the  deviance  is  easily  obtained.”  This  is  true  in  a  sense,  but  it  is  somewhat  deceptive. 


As  in  the  case  of  the  linear  regression  problem,  when  we  have  exact  replicates  in  an 
observational  study  the  definition  of  pure  error  depends  on  two  assumptions:  (i)  independence 


of  observations,  and  (ii)  a  correct  choice  of  the  explanatory  variables  for  inclusion  in  the 
model  (but  not  necessarily  the  functional  form).  Independence  is  a  major  problem  in  many 
applications  although  it  does  seem  reasonable  in  LPS’s  major  example.  The  specification  of 
relevant  explanatory  variables  is  more  serious  but  we  have  no  advice  for  dealing  with  it  except 
to  proceed  conditionally,  which  is  in  effect  what  LPS  do  with  their  example.  Thus  despite  our 
note  of  caution,  we  find  the  idea  of  using  a  pure  error  component  appealing. 

Now.  LPS  are  dealing  with  the  case  where  there  are  not  necessarily  exact  replicates,  and  they 
do  not  really  attempt  to  partition  the  deviance.  What  then  are  they  doing?  Through  the  use 
of  their  clustering  algorithm,  in  effect  they  are  partitioning  the  factor  space  (i.c.  the  m 
dimensional  space  of  explanatory  variables)  into  K  distinct  regions.  Then  they  attempt  to 
check  whether  the  logistic  regression  in  each  region  is  the  same  as  the  global  logistic  regression 
by  fitting  the  model 

logit  (p)  =  Zy  +  Xfi.  (1.1) 

where  =  1  or  0  according  to  whether  the  ith  observation  is  or  is  not  in  the  kth  region. 
We  wish  to  determine  if  the  model  is  the  same  in  all  regions,  i.e.  y  =  0  or 

logit  (p)  =  Xfi,  (1.2) 

and  we  can  test  for  this  directly  using  a  conditional  likelihood  ratio  test  of  model  (1.2)  versus 
model  (1.1).  If  N  is  large  relative  to  K  such  a  test,  and  if  the  regions  were  preformed,  the 
conditions  of  Haberman  (1974)  would  seem  to  be  satisfied  and  a  X2  reference  distribution  with 
K-l  degrees  of  freedom  would  be  appropriate.  (Tsiatis,  1980,  proposes  a  similar  test  but  uses  a 
Wald-like  quadratic  form  statistic.)  LPS  eschew  a  formal  test,  and  approach  the  issue 
graphically  by  examining  the  contributions  to  such  a  conditional  test  statistic  by  region  of  the 
factor  sp^ce,  in  a  cumulative  form.  This  seems  reasonable  if  we  don’t  have  preformed  groups 
of  data  points,  and  if  we  have  relied  on  a  clustering  algorithm  of  the  sort  suggested  by  LPS. 

How  well  does  the  LPS  graphical  approach  work?  We  can  explore  the  answer  to  this 
question  best  in  the  case  where  each  cluster  or  region  of  the  factor  space  consists  of  exact 
replicates.  As  Jennings  (1982)  notes,  under  the  null  hypothesis  of  model  (1.2),  the  expectation 


of  the  local  mean  square  deviance  for  the  kih  group.  should  be  approximately  the 

same  as  the  expectation  of  the  global  mean  deviance.  D  Jennings  has  carried  out  calculations 
for  some  simple  examples  that  suggest  that  the  local  mean  deviance  overestimates  the  global 
mean  deviance  for  p  near  0.5  and  underestimates  it  for  p  near  0  or  1.  Thus  the  order  in 
which  the  groups  arc  added  to  LPS's  running  estimate  of  local  mean  deviance  may  have  a 
substantial  impact  on  what  we  see  in  the  graphical  display.  As  a  consequence  we  are  unable  to 
share  LPS's  enthusiasm  for  this  display. 

A  simple  alternative  to  the  LPS  approach,  and  one  we  have  found  successful  in  practice,  is 
suggested  by  the  preceding  discussion.  Convert  each  of  the  m  explanatory  variables  into  sets 
of  categories,  and  restructure  the  data  in  the  form  of  an  (m+l)-dimcnsional  contingency  table 
with  an  m-dimensional  fixed  margin  (Bishop.  Ficnberg  and  Holland,  1975:  Fienberg,  1980). 
Now  examine  the  fit  of  the  logistic  regression  model  using,  say.  the  means  of  the  explanatory 
variables  in  each  cell.  To  examine  local  variation  we  can  use  the  generalized  residual  approach 
of  Habcrman  (1976).  This  contingency  table  structuring  can  also  be  used  to  explore  directly 
nonlinear  effects  of  the  explanatory  variables  and  interactions. 

2.  EMPIRICAL  PROBABILITY  PLOTS 

Both  this  plot  and  the  next  one  involve  the  adaptation  of  the  linear  regression  notion  of 
standardized  residuals.  Here.  LPS  begin  by  arguing  for  the  use  of  the  deviance  contribution, 
d  =  d(p;y),  standardized  by  its  approximate  standard  error.  We  sec  little  justification 

i  i  i 

considering  the  use  of  a  reference  distribution  in  this  setting  given  that,  for  extreme  values 
of  p,  d*  lends  to  a  2-point  distribution  and  Habcrman’s  (1974)  conditions  are  not  met  (e.g.  see 
the  discussion  in  Jennings,  1982). 

LPS's  alternative  is  to  use  the  residuals  y  -  p  and  a  simulation  procedure.  The  procedure 
here  is  evocative  of  one  proposed  by  Atkinson  (1981.  1982).  in  which  he  presents  half  normal 
plots  of  jackknife  residuals  and  a  modified  version  of  Cook’s  distance  statistic,  using  19 
samples  simulated  with  random  normal  y's  and  a  matrix  of  explanatory  variables  the  same  as 


4 

that  of  the  data.  Because  both  statistics  that  Atkinson  uses  are  functions  of  least  squares 
residuals  divided  by  an  estimate  of  scale,  the  values  of  the  parameters  of  the  linear  model  in 
the  simulation  do  not  matter.  This  is  not  the  case  in  LPS's  simulation  for  their  empirical 
probability  plot  Even  if  they  had  standardized  their  residuals  the  values  of  the  parameters  in 
the  logistic  regression  model.  /}.  do  matter.  LPS  attempt  to  finesse  this  problem  by  using  p 

i 

in  place  of  p  for  the  simulated  Bernoulli  variates,  but  this  simply  disguises  the  dependence  of 
the  simulation  on  the  p..  Thus  we  are  skeptical  about  attaching  the  usual  interpretation  to  the 
confidence  coefficients  used  throughout  Section  4  of  LPS. 

In  the  concluding  discussion.  LPS  mention  that  the  simulations  for  the  empirical  probability 
plots  are  in  the  same  spirit  as  the  bootstrap.  In  fact  the  simulations  used  to  get  the 

distribution  of  the  ordered  residuals  are  a  form  of  bootstrap.  We  illustrate  the  bootstrap 

argument  using  LPS's  Example  3.  We  assume  that  (X(,Y  ) .  *Xioo'Yioo^  arc  independent  and 

identically  distributed  (iid)  from  some  distribution  F.  where  F  specifies  that  Y1X  is  Bernoulli 

with  probability  of  success  P(X)  such  that  logit  P(X)  =  X  /?.  Then  we  form  the  residuals  Y 

-  Y  where  logit  Y  =  X  %  If  we  knew  F.  we  could  in  principle  simulate  to  get  the 

distribution  of  the  ordered  residuals  under  the  assumed  model  F.  Not  knowing  F.  wc  substitute 
an  estimate.  The  appropriate  estimate  here  is  F  which  specifies  that  the  probability  of  success 
is  Y.  We  comment  further  on  the  use  of  the  bootstrap  in  the  next  section. 

3.  PARTIAL  RESIDUAL  PLOTS 

In  the  case  of  ordinary  linear  regression,  collincarity  in  the  design  space  can  hide  a  sought- 
after  relationship  from  partial  residuals.  More  specifically,  if  the  underlying  model  is 

E(Y)  =  X  fi  ♦  g(z),  (3.1) 

then  partial  residuals  fail  to  detect  the  part  of  g(z)  which  is  in  the  column  space  of  X.  We 
expect  similar  problems  to  hold  in  the  logistic  regression  case. 

LPS  motivate  their  definition  of  partial  residuals  by  considering  an  analogy  to  ordinary  linear 
regression.  Here,  we  first  give  a  simple  calculation  to  show  why  partial  residuals  sometimes 


work,  and  then  we  give  an  example  to  show  when  partial  residuals  can  fail. 


Suppose  Y  is  Bernoulli  with  probability  of  success  P(z)  such  that 

logit  P(z)  =  a  +  g(z), 
where  z  is  real.  We  fit  the  logistic  model 

logit  P(z)  =  +  y  z. 

Linear  logistic  regression  estimates  P  in  (3.3)  when  what  we  are  after  is  P  in  (3.2). 
we  define  the  residual  logit 


P  1  -  P 

r  =  logit  P  -  logit  P  =  log  (  -  •  — - — L  \. 

l.nn  °  i  \1-P  P  / 


Using  the  first  order  Taylor  approximation  for  the  logarithm,  we  have 

r  =  logit  P  -  logit  P 


lopl 
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The  approximations  hold  if  the  ratios 


P  1-P 

-  *s  1,  - es  1. 

P  1-P 


Solving  for  g(z)  in  (3.2)  and  (3.3),  we  get 


P-P 


=  r  +  r  z  =  n  Tn ni  +  r  z- 


If  we  define 


p  (i-p) 


c<z)  =  pTFp)  *  »  z' 


(3.2) 

(3.3) 
Suppose 

(3.4) 


(3.5) 

(3.6) 

(3.7) 


(3.8 


then 
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E(G(z)!z)  ^  g(z).  (3.9) 

Substituting  estimates  for  P  and  y  in  (3.8)  gives  r  as  defined  by  LPS.  This  simple 

I  par 

calculation  shows  that  when  (3.6)  holds,  the  conditional  expectation  of  r  given  z  is 

par 

approximately  g(z).  When  z  is  discrete  and  the  number  of  observations  at  each  possible  value 
of  z  is  large,  we  can  average  the  partial  residuals  at  each  value  of  z  to  get  an  estimate  of 
g(z).  When  z  is  continuous,  we  cannot  take  averages  at  each  possible  value  of  z,  but  we  can 

smooth. 


We  now  give  a  counterexample  in  which  LPS’s  partial  residual  display  fails.  Suppose  logit 
P(z)  =  log(l+z)  where  the  possible  values  of  z  are 


It  turns  out  that 


0,  2.  4,  6,  8.  10,  20.  30,  40,  50. 


logit  P  =  a  +  y  z, 

where  a  -  0.85  and  y  =  0.10.  Figure  1  shows  a  scatterplot  of  G(z).  For  each  z,  G(z)  can 
take  two  possible  values  depending  on  whether  Y=1  or  Y=0.  At  these  two  values,  the  number 
of  dashes  reflect  the  true  proportions  of  successes  P(z)  and  failures  l-P(z).  We  may  think  of 
Figure  1  as  the  partial  residual  plot  gotten  by  looking  at  an  infinite  number  of  observations  at 
each  z.  Comparing  E(G(z)!z)  and  g(z)  =  log(l+z)  -  a,  we  see  that  partial  residuals  fail  when  z 
exceeds  30.  In  Figure  2,  we  calculate 


P  1-P 

—  and  — i  , 

P  1-P 

i 

for  z  =  30,  40.  50,  and  we  see  that  for  these  values  of  z,  the  approximation  (3.6)  simply  does 


not  hold. 


I 


Our  final  comment  on  partial  residuals  is  one  on  assessment.  In  the  breast  cancer  example, 
LPS  examine  numerous  partial  residual  plots  to  obtain  a  rather  complicated  model  involving  7 
parameters.  Assessing  the  fit  of  this  model  is  a  difficult  but  important  problem.  There  are 
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two  questions  we  might  ask.  if  new  patients  were  observed,  would  the  model  afford  a  good 
rule  for  predicting  their  outcome?  If  the  experiment  were  performed  again  with  a  new  sample 
of  306  patients,  would  LPS  conclude  the  same  model?  If  we  could  automate  the  thought 
processes  that  produce  say,  the  conclusion  that  LPS's  Figure  8  shows  a  cubic  dependence,  we 
could  use  the  bootstrap  to  help  answer  these  questions.  (See  Gong  1982a.  1982b  for  the  use  of 
the  bootstrap  in  assessing  another  complicated  prediction  rule.) 

We  include  here  just  an  idea  of  how  the  bootstrap  can  help.  Look  at  LPS's  Figure  8.  If 
we  repeated  the  experiment,  observing  a  new  sample  of  306  patients,  would  we  get  a  picture 
similar  to  their  Figure  8.  and  would  LPS  arrive  at  the  same  conclusion  of  a  cubic  dependence? 
Suppose  the  data  (X  ,Y  ) (X  ,Y  )  were  iid  with  distribution  F.  If  we  knew  F.  we  could 

II  306  306 

in  principle  generate  a  new  sample  of  patients.  Since  we  don't  know  F,  substitute  an  estimate, 
the  empirical  distribution  F  which  puts  mass  1/306  at  each  observation.  The  resulting  sample 
is  a  bootstrap  sample.  Our  Figure  3  shows  the  partial  residual  smooths  of  10  bootstrap 
samples.  (Performing  25  bootstraps  gives  similar  results  but  a  more  confusing  picture.)  How  do 
we  interpret  this  picture?  The  smooth  based  on  the  original  sample  is  our  estimator  of  the 

nonlinear  relation  of  logit  P  on  age.  The  smooths  based  on  the  bootstrap  samples  tell  us  about 

the  variability  of  that  estimator.  Since  the  shape  of  the  bootstrap  smooths  all  tend  to  be 
similar,  we  have  some  confidence  in  the  original  sample  smooth  as  an  estimator  for  the  shape 
of  the  nonlinear  relation  of  logit  P  on  age. 

Figure  4  shows  the  partial  residual  smooths  of  10  bootstraps  for  the  second  covariate,  the 

year  of  surgery.  The  shape  of  these  smooths  arc  also  very  similar,  indicating  that  the  partial 

residual  smooth  of  the  original  sample,  also  given  in  Figure  i  is  a  good  estimator  for  the 
nonlinear  dependence  of  logit  P  on  year  oi  surgery.  There  is  an  interesting  decrease  in 
survival  for  surgery  during  1965. 

Figure  5  shows  the  smooths  of  10  bootstraps  for  the  third  covariate,  the  number  of  nodes. 
The  shape  of  the  smooths  for  small  number  of  nodes  remains  constant  throughout  the 
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bootstraps,  while  the  shape  for  large  number  of  nodes  is  highly  variable.  If  logit  P  does 
depend  on  number  of  nodes  through  -log(l+z),  we  have  a  situation  similar  to  our 
counterexample.  For  z  large,  the  difference  between  G(z,Y=l)  and  G(z,Y=0)  is  astronomical, 
so  that  observing  a  few  more  Y=l’s  can  pull  the  smooth  way  up.  while  observing  a  few  more 
Y=0’s  can  pull  the  smooth  down.  The  high  variability  in  shape  is  ultimately  tied  to  the  fact 
that  there  are  just  not  very  many  patients  with  large  number  of  nodes. 

4.  ADDITIONAL  COMMENTS 

Although  the  amount  of  calculation  required  to  produce  the  graphical  displays  proposed  by 
LPS  seems  at  first  blush  immense,  the  plots  can  in  fact  be  generated  with  relative  ease  using  a 
statistical  software  package  such  as  Bell  Laboratories’  "S".  This  is  a  major  virtue,  and  argues 
for  further  efforts  to  build  up  on  LPS’s  pioneering  work.  Variants  on  and  alternatives  to  their 
three  graphical  displays  can  be  explored  with  the  same  ease  as  can  the  original  displays. 

In  future  work  in  this  area  we  would  argue  for  somewhat  less  reliance  on  fuzzy  analogies  to 
linear  regression  techniques,  and  we  would  focus  more  directly  on  the  theory  that  underlies  the 
problem  at  hand.  Of  potential  importance  for  the  logistic  regression  selling  is  Jennings'  (1982) 
measure  of  inference  adequacy,  a  measure  closely  related  to  the  curvature  measures  of  Bates 
and  Watts  (1980).  His  approach  has  implications  both  for  model  specification  and  for  the 
form  of ''parametrization  of  the  logistic  regression  model  to  achieve  computational  efficiency 
and  inferential  stability. 

The  use  of  kernel-based  density  estimation  in  the  development  of  grapical  displays  for  binary 
response  models  also  bears  scrutiny  (e.g.  see  Titterington,  1980.  and  Tittcrington  et  al..  1981). 
Although  such  an  approach  has  been  advocated  as  "non-paramctric"  (e.g.  sec  Copas,  1983)  we 
see  its  major  strengths  as  (a)  the  smoothness  of  the  resulting  probabilities,  and  (b)  the  checks 
it  might  provide  for  the  examination  of  model  adequacy  and  local  variation. 

Developing  useful  graphical  displays  is,  in  general,  a  difficult  task.  LPS  have  demonstrated 
how  ideas  for  diagnostic  displays  from  ordinary  linear  regression  can  be  adapted  to  the  binary 
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Figure  1.  The  Partial  Residual  Plot  for  the  Counterexample. 

The  dashes  form  a  scatterplot  of  G(z).  The  cur\e  through  this  scatierplot  is  E(G(z)!z)  which 
estimates  the  true  function  g'z)  indicated  by  the  triangles. 

Figure  2.  Understanding  the  Failure  of  Partial  Residuals 
Condition  (3.6'  fails  to  hold  for  z  =  30.  40,  50  in  the  counterexample. 
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.993 
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.981 

.997 
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Figure  3.  Bootstraps  for  Age 

The  partial  residual  smooths  of  10  bootstrap  samples  together  with  that  of  the  original  sample 
of  the  covariate  X(  =  age. 

Figure  4.  Bootstraps  for  Year 

The  partial  residual  smooths  of  10  bootstrap  samples  together  with  that  of  the  original  sample 
of  the  covariate  =  year  of  surgery. 

Figure  5.  Bootstraps  for  Nodes 

The  partial  residual  smooths  of  10  bootstrap  samples  together  with  that  of  the  original  sample 
of  the  covariate  x  =  number  of  nodes. 
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